Today we will be continuing the pumpkin case study from last week. We will be using the data that you cleaned and split last time (pumpkins_train) and will be comparing our results today to those you have already obtained, so open and run your Lab 1 .Rmd as a first step so those objects are available in your Environment (unless you created an R Project last time, in which case, kudos to you!).
Once you have done that, we’ll start today’s lab by specifying a recipe for a polynomial model. First we specify a recipe that identifies our variables and data, converts package to a numerical form, and then add a polynomial effect with step_poly()
# Specify a recipe
poly_pumpkins_recipe <-
recipe(price ~ package, data = pumpkins_train) %>%
step_integer(all_predictors(), zero_based = TRUE) %>%
step_poly(all_predictors(), degree = 4)
How did that work? Choose another value for degree if you need to. Later we will learn about model tuning that will let us do things like find the optimal value for degree. For now, we’d like to have a flexible model, so find the highest value for degree that is consistent with our data.
Polynomial regression is still linear regression, so our model specification looks similar to before.
# Create a model specification called poly_spec
poly_spec <- linear_reg() %>%
set_engine("lm") %>%
set_mode("regression")
Question 1: Now take the recipe and model specification that just created and bundle them into a workflow called poly_wf.
# Question 1. Bundle recipe and model spec into a workflow
poly_wf <- workflow() |>
add_recipe(poly_pumpkins_recipe) |>
add_model(poly_spec)
print("Answer 1. See above workflow.")
## [1] "Answer 1. See above workflow."
Question 2: fit a model to the pumpkins_train data using your workflow and assign it to poly_wf_fit
# Question 2. fit a model
poly_wf_fit <- poly_wf |>
fit(data = pumpkins_train)
print("Answer 2. Above, you'll see code for how we fit the model to the pumpkins training dataset. Below, you'll see the corresponding model coefficients, then subsequently 10 instances of model predictions of price (.pred) on the test set compared to actual values of price (price).")
## [1] "Answer 2. Above, you'll see code for how we fit the model to the pumpkins training dataset. Below, you'll see the corresponding model coefficients, then subsequently 10 instances of model predictions of price (.pred) on the test set compared to actual values of price (price)."
# Print learned model coefficients
poly_wf_fit
## ══ Workflow [trained] ══════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 2 Recipe Steps
##
## • step_integer()
## • step_poly()
##
## ── Model ───────────────────────────────────────────────────────────────────────
##
## Call:
## stats::lm(formula = ..y ~ ., data = data)
##
## Coefficients:
## (Intercept) package_poly_1 package_poly_2 package_poly_3 package_poly_4
## 27.9706 103.8566 -110.9068 -62.6442 0.2677
# Make price predictions on test data
poly_results <- poly_wf_fit %>% predict(new_data = pumpkins_test) %>%
bind_cols(pumpkins_test %>% select(c(package, price))) %>%
relocate(.pred, .after = last_col())
# Print the results
poly_results %>%
slice_head(n = 10)
## # A tibble: 10 × 3
## package price .pred
## <chr> <dbl> <dbl>
## 1 1 1/9 bushel cartons 13.6 15.9
## 2 1 1/9 bushel cartons 16.4 15.9
## 3 1 1/9 bushel cartons 16.4 15.9
## 4 1 1/9 bushel cartons 13.6 15.9
## 5 1 1/9 bushel cartons 15.5 15.9
## 6 1 1/9 bushel cartons 16.4 15.9
## 7 1/2 bushel cartons 34 34.4
## 8 1/2 bushel cartons 30 34.4
## 9 1/2 bushel cartons 30 34.4
## 10 1/2 bushel cartons 34 34.4
Now let’s evaluate how the model performed on the test_set using yardstick::metrics().
metrics(data = poly_results, truth = price, estimate = .pred)
## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 3.27
## 2 rsq standard 0.892
## 3 mae standard 2.35
Question 3: How do the performance metrics differ between the linear model from last week and the polynomial model we fit today? Which model performs better on predicting the price of different packages of pumpkins?
Let’s visualize our model results. First prep the results by binding the encoded package variable to them.
# question 3
print("Question 3. These performance metrics illustrate that the polynomial model fits the data better than the linear model. We know this since the RMSE and MAE are approximately halved in this model compared to the last, and the R-squared is approximately doubled. In other words, with the polynomial model, our errors between what the model is predicting and what the true value is are much smaller, and the variability in our data is much better explained by the relationship of variables within this polynomial model.")
## [1] "Question 3. These performance metrics illustrate that the polynomial model fits the data better than the linear model. We know this since the RMSE and MAE are approximately halved in this model compared to the last, and the R-squared is approximately doubled. In other words, with the polynomial model, our errors between what the model is predicting and what the true value is are much smaller, and the variability in our data is much better explained by the relationship of variables within this polynomial model."
# Bind encoded package column to the results
poly_results <- poly_results |>
bind_cols(package_encode |>
rename(package_integer = package)) |>
relocate(package_integer, .after = package)
# Print new results data frame
poly_results %>%
slice_head(n = 5)
## # A tibble: 5 × 4
## package package_integer price .pred
## <chr> <int> <dbl> <dbl>
## 1 1 1/9 bushel cartons 0 13.6 15.9
## 2 1 1/9 bushel cartons 0 16.4 15.9
## 3 1 1/9 bushel cartons 0 16.4 15.9
## 4 1 1/9 bushel cartons 0 13.6 15.9
## 5 1 1/9 bushel cartons 0 15.5 15.9
OK, now let’s take a look!
Question 4: Create a scatter plot that takes the poly_results and plots package vs. price. Then draw a line showing our model’s predicted values (.pred). Hint: you’ll need separate geoms for the data points and the prediction line.
# Make a scatter plot
print("Answer 4. See the below scatter plot to see the relationship between pumpkin package and price The orange line represents the model's predicted values.")
## [1] "Answer 4. See the below scatter plot to see the relationship between pumpkin package and price The orange line represents the model's predicted values."
poly_results %>% ggplot(mapping = aes(x = package_integer, y = price)) +
geom_point() +
geom_line(mapping = aes(y = .pred), color = "orange", size = 1.2)
You can see that a curved line fits your data much better.
Question 5: Now make a smoother line by using geom_smooth instead of geom_line and passing it a polynomial formula like this: geom_smooth(method = lm, formula = y ~ poly(x, degree = 3), color = “midnightblue”, size = 1.2, se = FALSE)
# Make a smoother scatter plot
print("Answer 5. See the below scatter plot to see the relationship between pumpkin package and price. This time, the blue line represents the a 3rd degree polynomial regression that models the data.")
## [1] "Answer 5. See the below scatter plot to see the relationship between pumpkin package and price. This time, the blue line represents the a 3rd degree polynomial regression that models the data."
poly_results %>% ggplot(mapping = aes(x = package_integer, y = price)) +
geom_point() +
geom_smooth(method = lm, formula = y ~ poly(x, degree = 3), color = "midnightblue", size = 1.2, se = FALSE)
OK, now it’s your turn to go through the process one more time.
Additional assignment components :
print("Answer 6. I am selecting variety as my new predictor variable since this variable is highly correlated with price.")
## [1] "Answer 6. I am selecting variety as my new predictor variable since this variable is highly correlated with price."
cor_variety <- cor(baked_pumpkins$variety, baked_pumpkins$price)
cor_variety
## [1] -0.863479
print(paste0("Question 7. The correlation between pumpkin variety and pumpkin price is approx: ", round(abs(cor_variety), 2)))
## [1] "Question 7. The correlation between pumpkin variety and pumpkin price is approx: 0.86"
# create recipe
poly_pumpkins_recipe_variety <-
recipe(price ~ variety, data = pumpkins_train) %>%
step_integer(all_predictors(), zero_based = TRUE) %>%
step_poly(all_predictors(), degree = 3)
# Create model specification
poly_spec_variety <- linear_reg() %>%
set_engine("lm") %>%
set_mode("regression")
# Bundle recipe and model spec into a workflow
poly_wf_variety <- workflow() |>
add_recipe(poly_pumpkins_recipe_variety) |>
add_model(poly_spec_variety)
# fit a model
poly_wf_fit_variety <- poly_wf_variety |>
fit(data = pumpkins_train)
# Make price predictions on test data
poly_results_variety <- poly_wf_fit_variety %>% predict(new_data = pumpkins_test) %>%
bind_cols(pumpkins_test %>% select(c(variety, price))) %>%
relocate(.pred, .after = last_col())
# Print the results
poly_results_variety %>%
slice_head(n = 10)
## # A tibble: 10 × 3
## variety price .pred
## <chr> <dbl> <dbl>
## 1 PIE TYPE 13.6 16.2
## 2 PIE TYPE 16.4 16.2
## 3 PIE TYPE 16.4 16.2
## 4 PIE TYPE 13.6 16.2
## 5 PIE TYPE 15.5 16.2
## 6 PIE TYPE 16.4 16.2
## 7 MINIATURE 34 34.2
## 8 MINIATURE 30 34.2
## 9 MINIATURE 30 34.2
## 10 MINIATURE 34 34.2
# build a recipe
lm_pumpkins_recipe_variety <- recipe(price ~ variety, data = pumpkins_train) %>%
step_integer(all_predictors(), zero_based = TRUE)
# Encode package column
variety_encode <- lm_pumpkins_recipe_variety %>%
prep() %>%
bake(new_data = pumpkins_test) %>%
select(variety)
# Bind encoded package column to the results
poly_results_variety <- poly_results_variety |>
bind_cols(variety_encode |>
rename(variety_integer = variety)) |>
relocate(variety_integer, .after = variety)
# Print new results data frame
poly_results_variety %>%
slice_head(n = 5)
## # A tibble: 5 × 4
## variety variety_integer price .pred
## <chr> <int> <dbl> <dbl>
## 1 PIE TYPE 3 13.6 16.2
## 2 PIE TYPE 3 16.4 16.2
## 3 PIE TYPE 3 16.4 16.2
## 4 PIE TYPE 3 13.6 16.2
## 5 PIE TYPE 3 15.5 16.2
metrics(data = poly_results_variety, truth = price, estimate = .pred)
## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 4.13
## 2 rsq standard 0.827
## 3 mae standard 2.52
poly_results_variety %>% ggplot(mapping = aes(x = variety_integer, y = price)) +
geom_point() +
geom_line(mapping = aes(y = .pred), color = "orange", size = 1.2)
Lab 2 due 1/24 at 11:59 PM